OOV Detection and Recovery Using Hybrid Models with Different Fragments
نویسندگان
چکیده
In this paper, we address the out-of-vocabulary (OOV) detection and recovery problem by developing three different fragmentword hybrid systems. A fragment language model (LM) and a word LM were trained separately and then combined into a single hybrid LM. Using this hybrid model, the recognizer can recognize any OOVs as fragment sequences. Different types of fragments, such as phones, subwords, and graphones were tested and compared on the WSJ 5k and 20k evaluation sets. The experiment results show that the subword and graphone hybrid systems perform better than the phone hybrid system in both 5k and 20k tasks. Furthermore, given less training data, the subword hybrid system is more preferable than the graphone hybrid system.
منابع مشابه
OOV Word Detection using Hybrid Models with Mixed Types of Fragments
This paper presents initial studies to improve the out-ofvocabulary (OOV) word detection performance by using mixed types of fragment units in one hybrid system. Three types of fragment units, subwords, syllables, and graphones, were combined in two different ways to build the hybrid lexicon and language model. The experimental results show that hybrid systems with mixed types of fragment units...
متن کاملLearning Out-of-Vocabulary Words in Automatic Speech Recognition
Out-of-vocabulary (OOV) words are unknown words that appear in the testing speech but not in the recognition vocabulary. They are usually important content words such as names and locations which contain information crucial to the success of many speech recognition tasks. However, most speech recognition systems are closed-vocabulary recognizers that only recognize words in a fixed finite vocab...
متن کاملSpoken Term Detection for Persian News of Islamic Republic of Iran Broadcasting
Islamic Republic of Iran Broadcasting (IRIB) as one of the biggest broadcasting organizations, produces thousands of hours of media content daily. Accordingly, the IRIBchr('39')s archive is one of the richest archives in Iran containing a huge amount of multimedia data. Monitoring this massive volume of data, and brows and retrieval of this archive is one of the key issues for this broadcasting...
متن کاملOOV-detection in large vocabulary system using automatically defined word-fragments as fillers
The problem of unknown words has been addressed using automatically generated ller fragments which augment the lexicon and are incorporated in the language model. These fragments are used to reduce the damage on in-vocabulary words, to detect OOV regions and to provide a phonetic transcription for these regions. The performance of this technique has been evaluated in terms of damage reduction e...
متن کاملTHE JOHNS HOPKINS UNIVERSITY Sub-Lexical and Contextual Modeling of Out-of-Vocabulary Words in Speech Recognition
Large vocabulary speech recognition systems fail to recognize words beyond their vocabulary, many of which are information rich terms, like named entities or foreign words. Hybrid word/sub-word systems solve this problem by adding sub-word units to large vocabulary word based systems; new words can then be represented by combinations of subword units. We present a novel probabilistic model to l...
متن کامل